NSF PAR Search | NSF Public Access Repository

Noise-Robust Speech Emotion Recognition Using Shared Self-Supervised Representations with Integrated Speech Enhancement

https://doi.org/10.1109/ICASSP49660.2025.10887569

Tzeng, Jing-Tong; Leem, Seong-Gyun; Salman, Ali N; Lee, Chi-Chun; Busso, Carlos (April 2025, IEEE)

Recent studies have demonstrated the effectiveness of fine-tuning self-supervised speech representation models for speech emotion recognition (SER). However, applying SER in real-world environments remains challenging due to pervasive noise. Relying on low-accuracy predictions due to noisy speech can undermine the user’s trust. This paper proposes a unified self-supervised speech representation framework for enhanced speech emotion recognition designed to increase noise robustness in SER while generating enhanced speech. Our framework integrates speech enhancement (SE) and SER tasks, leveraging shared self-supervised learning (SSL)-derived features to improve emotion classification performance in noisy environments. This strategy encourages the SE module to enhance discriminative information for SER tasks. Additionally, we introduce a cascade unfrozen training strategy, where the SSL model is gradually unfrozen and fine-tuned alongside the SE and SER heads, ensuring training stability and preserving the generalizability of SSL representations. This approach demonstrates improvements in SER performance under unseen noisy conditions without compromising SE quality. When tested at a 0 dB signal-to-noise ratio (SNR) level, our proposed method outperforms the original baseline by 3.7% in F1-Macro and 2.7% in F1-Micro scores, where the differences are statistically significant.

Free, publicly-accessible full text available April 6, 2026

Search for: All records